REPORT LAYOUT: - Introduction - General Analysis and insights (try to find unique insights) - Analysis of factors affecting revenue - ML and prediction of select movies

Executive Summary

Project Framework: Definition and Methodology

In this report, we will explore the various factors that impact and influence the monetary success of a movie at the box office. Our investigation extends beyond mere fiscal considerations, encompassing a nuanced analysis of important factors such as the expertise of the cast and crew. By scrutinizing these diverse components, this report aims to provide a comprehensive understanding of the factors that defines a movie’s monetary success at the box office.

The data was obtained through the use of our own web scraping algorithm and covers the top 75 grossing movies over the past 25 years.

Temporal Analysis

Over time, the average revenue demonstrates a distinct upward trend, with a notable observation regarding the rate of growth in Foreign revenue compared to Domestic revenue. The surge in global revenue is primarily driven by the rapid expansion of foreign revenue, highlighting the escalating growth and acceptance of Western films in international markets.

The onset of the Covid-19 Pandemic significantly impacted the film industry, evident in the graph. Productions were halted, and theaters closed, leading to a substantial loss of earning potential. The lockdown measures globally disrupted filming schedules, postponed releases, and the closure of theaters eliminated a crucial avenue for revenue. This had a ripple effect across the industry, affecting filmmakers, actors, crew members, distributors, and exhibitors. The industry’s vulnerability to external shocks became apparent, prompting the need for innovative adaptations to navigate the challenges such as online releases.

The impact of the month of film release is a fascinating observation. Notably, films hitting the screens in May and June consistently outperform those released in other months. Utilizing an analysis of variance (ANOVA) shows a significant disparity in average revenue across different release months. Several factors contribute to this phenomenon:

  1. Summer Blockbuster Season: May and June fall within the traditional summer movie season in numerous regions. Studios strategically unveil high-budget blockbuster films during this period, targeting a broad audience. The warmer weather and school vacations further boost movie attendance.

  2. Strategic Release Patterns: The film industry acknowledges this pattern, leading to a clustering effect. Recognizing the advantageous months, more popular and anticipated films tend to be strategically released during May and June. This intentional scheduling capitalizes on the observed heightened audience engagement during these months.

  3. Genre Preferences: Certain movie genres, such as action, adventure, and fantasy, are often associated with May and June releases as seen in the graph. These genres tend to draw larger audiences and generate higher revenue, contributing to the observed pattern. (Median Revenue used to account for outliers)

    term

    df

    sumsq

    meansq

    statistic

    p.value

    Month

    11

    8.114459e+18

    7.376781e+17

    10.76122

    0

    Residuals

    1848

    1.266797e+20

    6.854964e+16

    NA

    NA

Another notable observation is the seasonality exhibited in the average revenue over a year. The seasonal strength, quantified by a value of 0.5817723, signifies a substantial recurring pattern within our data set.

This strong seasonality implies that there are recurring trends or patterns in revenue that manifest on an annual basis. It suggests that certain times of the year consistently contribute to increased or decreased revenue. Understanding and leveraging this seasonality can be pivotal for strategic decision-making in the realm of film releases.

In practical terms, this finding prompts a closer examination of the temporal distribution of revenue throughout the year. A more detailed exploration of which months or seasons contribute significantly to high or low average revenues can unveil insights that may guide release strategies, marketing efforts, or resource allocation.

trend_strength seasonal_strength_year seasonal_peak_year seasonal_trough_year spikiness linearity curvature stl_e_acf1 stl_e_acf10
0.5208086 0.5817723 5 8 8.034234e+27 1142997602 -136943061 -0.0645241 0.0550351

Film Characteristics and Insights

Undoubtedly, a film’s budget stands as the biggest influencer of its success, controlling most aspects of production. A substantial financial backing allows for elevated production values, sophisticated marketing strategies, and the recruitment of established talent, all crucial elements that contribute to a film’s overall quality and marketability. This dynamic relationship is graphically portrayed by the slope of the regression line, emphasizing the large influence of budget and the multifaceted components shaping a film’s trajectory.

The genre of a film is a crucial aspect that defines its style, tone, and overall artistic expression. It serves as a blueprint, giving audiences a general idea of what to expect and helping filmmakers convey their vision effectively. The genre serves as a crucial component in the marketing and promotion of a film. It helps studios target specific demographics and tailor promotional campaigns to reach the intended audience.

As illustrated in the chart, Sci-Fi and Adventure emerge as the most lucrative genres within the film industry. This can be primarily attributed to the presence of many blockbuster titles within these specific genres. Despite the presence of outliers, which could represent exceptional cases or singular phenomena, the overarching trend reflected in the chart suggests a consistent and widespread favoritism towards Sci-Fi and Adventure genres. This pattern implies that audiences are consistently drawn to these genres, reinforcing their status as the forefront contributors to the film industry’s financial success.

Upon analyzing the revenue distributions across genres, a notable observation emerged: the box plots for the Fantasy and Family genres exhibited remarkable similarity. This observation prompted a deeper exploration into the correlations among various genre combinations.

A noteworthy finding was the high positive correlation between the Animation and Family genres. This correlation aligns seamlessly with the prevalent trend of animated family-oriented films. Conversely, an intriguing insight surfaced when examining the negative correlation between Thriller and Comedy genres. This distinctive relationship suggests an unconventional pairing that has not been extensively explored in the cinematic landscape.

This negative correlation sparks a thought-provoking notion — the potential for an innovative and revolutionary genre combination. The rarity of Thriller-Comedy hybrids in the current cinematic landscape presents an opportunity for filmmakers to experiment. This unexplored territory not only provides creative potential but also introduces the possibility of captivating a diverse audience with a novel cinematic experience.

Movies often navigate across various genres to broaden their appeal and cater to a diverse demographic. The trend indicates that a film’s performance tends to improve as it incorporates multiple genres, with the optimal balance appearing to be around six genres.

length <- df %>% 
  dplyr::select(Title, Worldwide, `Run Time (Mins)`, `Earliest Release Date`) %>% 
  dplyr::filter(`Earliest Release Date` >= "1999-01-01") %>% 
  dplyr::mutate(Length_Category = case_when(
    `Run Time (Mins)` < 90 ~ "Short",
    `Run Time (Mins)` < 120 ~ "Medium",
    TRUE ~ "Long")) %>% 
  dplyr::group_by(Length_Category) %>% 
  dplyr::summarise(AvgRev = mean(Worldwide),
                   Count = n())

length$Length_Category <- factor(length$Length_Category, levels = c("Short", "Medium", "Long"))

ggplot(length, aes(x = Length_Category, y = AvgRev)) +
  geom_bar(stat = "summary", fun = "mean", fill = "skyblue", color = "black") +
  labs(title = "Average Revenue by Movie Length Category",
       x = "Length Category",
       y = "Average Revenue") +
  theme_minimal()

##Cast & Crew Analysis The composition of the cast is a crucial factor in the success of a film, where actors have the ability to either propel or hinder its success. While the quantification of an actor’s precise impact on a film’s success may pose a challenge, delving into the data reveals enlightening insights.

There appears to be a shift in audience preferences. The audience, while still embracing the familiarity of established actors, demonstrates a growing affinity for emerging talents. This graph also reveals the audience is now forging connections with a new cohort of actors who have become their own “regulars.”

When seeking the ideal star for an upcoming film, Robert Downey Jr. or Chris Pratt stands out. Both actors have established themselves as powerhouses of the new cohort of actors.

cast <- df %>% 
  dplyr::select(Title, Worldwide, Cast) %>% 
  tidyr::separate_rows(Cast, sep = ", ") %>% 
  dplyr::group_by(Cast) %>% 
  dplyr::summarise(AvgRev = mean(Worldwide),
                   Count = n()) %>% 
  dplyr::mutate(Movies_Acted = case_when(
    Count >= 5 & Count <= 10 ~ '5-10',
    Count > 10 & Count <= 15 ~ '10-15',
    Count > 15 & Count <= 20 ~ '15-20',
    Count > 20 ~ '20+',
    TRUE ~ 'Less than 5'
  )) %>% 
  dplyr::group_by(Movies_Acted) %>% 
  dplyr::summarise(AvgRev = mean(AvgRev),
                   Count = n()) %>% 
  dplyr::arrange(desc(AvgRev))

# Changing preference for newer faces or different types of story telling. Still like regulars 
range_order <- c("Less than 5", "5-10", "10-15", "15-20", "20+")
cast$Movies_Acted <- factor(cast$Movies_Acted, levels = range_order)
ggplot(cast, aes(x = Movies_Acted, y = AvgRev, fill = factor(Count))) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  scale_fill_viridis_d() +
  labs(title = "Average Revenue by Number of Movies Acted",
       x = "Movies_Acted",
       y = "Average Revenue",
       fill = "Count") +
  theme_minimal()

star <- df %>% 
  dplyr::select(Worldwide, Star) %>% 
  dplyr::group_by(Star) %>% 
  dplyr::summarise(AvgRev = mean(Worldwide),
                   Count = n()) %>% 
  dplyr::filter(Count >= 5) %>% 
  dplyr::arrange(desc(AvgRev))

star_plot <- star %>%  
  ggplot(aes(x = Count, y = AvgRev, size = Count, color = Count,
             text = paste("Star:", Star, "<br>Number of Movies:", Count, "<br>Average Revenue:", scales::dollar(AvgRev)))) +
  geom_point() +
  labs(title = "Movie Stars and Avg Revenue",
       x = "Number of Movies",
       y = "Average Revenue",
       size = "Number of Movies")
plotly::ggplotly(star_plot, tooltip = "text")
#RUN REGRESSION ANALYSIS

budget <- df %>% 
  dplyr::select(Title, Worldwide, Director, Writer, Budget, Cinematographer, `Production Designer`, Editor, Producer, Composer) %>% 
  stats::na.omit() %>% 
  dplyr::mutate(Director_Count = str_count(Director, ",") + 1,
                Writer_Count = str_count(Writer, ",") + 1,
                Cinematographer_Count = str_count(Cinematographer, ",") + 1,
                Prduction_Designer_Count = str_count(`Production Designer`, ",") + 1,
                Editor_Count = str_count(Editor, ",") + 1,
                Producer_Count = str_count(Producer, ",") + 1,
                Composer_Count = str_count(Composer, ",") + 1)

crew_lm <- lm(Budget ~ Director_Count + Writer_Count + Cinematographer_Count + Prduction_Designer_Count + Editor_Count + Producer_Count + Composer_Count, data = budget)

summary(crew_lm)
## 
## Call:
## lm(formula = Budget ~ Director_Count + Writer_Count + Cinematographer_Count + 
##     Prduction_Designer_Count + Editor_Count + Producer_Count + 
##     Composer_Count, data = budget)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -239562855  -37033548  -11285902   25621307  254699434 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               44426150   12053433   3.686 0.000238 ***
## Director_Count           -13864534    4922064  -2.817 0.004932 ** 
## Writer_Count               7212409     708813  10.175  < 2e-16 ***
## Cinematographer_Count     -3045291    7594342  -0.401 0.688497    
## Prduction_Designer_Count  13427453    6176014   2.174 0.029895 *  
## Editor_Count              17828736    2283569   7.807 1.29e-14 ***
## Producer_Count             -401127    1110251  -0.361 0.717945    
## Composer_Count           -10094512    3977026  -2.538 0.011271 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52320000 on 1171 degrees of freedom
## Multiple R-squared:  0.1495, Adjusted R-squared:  0.1444 
## F-statistic: 29.41 on 7 and 1171 DF,  p-value: < 2.2e-16
revenue_lm <- lm(Worldwide ~ Director_Count + Writer_Count + Cinematographer_Count + Prduction_Designer_Count + Editor_Count + Producer_Count + Composer_Count, data = budget)

summary(revenue_lm)
## 
## Call:
## lm(formula = Worldwide ~ Director_Count + Writer_Count + Cinematographer_Count + 
##     Prduction_Designer_Count + Editor_Count + Producer_Count + 
##     Composer_Count, data = budget)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -714413866 -136594300  -67866018   55173394 2379594808 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              173252773   58382919   2.968  0.00306 ** 
## Director_Count           -45863170   23840883  -1.924  0.05463 .  
## Writer_Count              25272241    3433260   7.361 3.43e-13 ***
## Cinematographer_Count     16729738   36784527   0.455  0.64933    
## Prduction_Designer_Count  46089032   29914608   1.541  0.12366    
## Editor_Count              58584723   11060868   5.297 1.41e-07 ***
## Producer_Count           -14196367    5377694  -2.640  0.00840 ** 
## Composer_Count           -44948304   19263421  -2.333  0.01980 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 253400000 on 1171 degrees of freedom
## Multiple R-squared:  0.08869,    Adjusted R-squared:  0.08324 
## F-statistic: 16.28 on 7 and 1171 DF,  p-value: < 2.2e-16
autoplot(revenue_lm, size = 0.5)

MACHINE LEARNING

library(dplyr)
library(parsnip)
library(ggplot2)
# Split data
set.seed(123)  # Set a seed for reproducibility
#df_shuffled <- df[sample(nrow(df)), ] # Shuffle data to elimate some bias hopefully/
df_split <- df %>% rsample::initial_split(prop = 0.80)
df_train <- rsample::training(df_split)
df_test  <- rsample::testing(df_split)

# Worldwide as dependent variable
recipe_pipeline <- recipes::recipe(Worldwide ~ Budget + Distributor + `Release Month` + MPAA + `Run Time (Mins)` + `First Genre`+`count_genres`, data = df_train) %>%
  # step_rm(date) %>%
  recipes::prep()

train_baked <- recipes::bake(recipe_pipeline, df_train)

recipe_pipeline <- recipes::recipe(Worldwide ~ Budget + Distributor + `Release Month` + MPAA + `Run Time (Mins)` + `First Genre`+`count_genres`, data = df_test) %>%
  # step_rm(date) %>%
  recipes::prep()

test_baked <- recipes::bake(recipe_pipeline, df_test)  # Corrected to use test_baked


common_levels_distributor <- intersect(levels(train_baked$Distributor), levels(test_baked$Distributor))
train_baked$Distributor <- factor(train_baked$Distributor, levels = common_levels_distributor)
test_baked$Distributor <- factor(test_baked$Distributor, levels = common_levels_distributor)



# Modeling
model <- parsnip::decision_tree(mode = "regression") %>%
  parsnip::set_engine("rpart") %>%
  parsnip::fit(Worldwide ~ Budget + Distributor + `Release Month` + MPAA + `Run Time (Mins)` + `First Genre`+`count_genres`, data = train_baked)
model
## parsnip model object
## 
## n= 1492 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1492 1.122213e+20  261062300  
##    2) Budget< 1.445e+08 1335 5.514783e+19  215972100  
##      4) Budget< 8.9e+07 1083 2.741099e+19  179967600  
##        8) First Genre=Action,Animation,Biography,Comedy,Crime,Documentary,Drama,Fantasy,Horror,Mystery,Romance,Sci-Fi,Thriller 941 1.688246e+19  168897400 *
##        9) First Genre=Adventure 142 9.649017e+18  253327200  
##         18) Release Month=Aug,Dec,Feb,Jan,Mar,May,Nov,Oct,Sep 108 2.392568e+18  201788600 *
##         19) Release Month=Apr,Jul,Jun 34 6.058331e+18  417038300  
##           38) Distributor=DreamWorks Distribution,GKIDS,Sony Pictures Entertainment (SPE),Twentieth Century Fox,Walt Disney Studios Motion Pictures 20 6.195770e+17  219535500 *
##           39) Distributor=Universal Pictures,Warner Bros. 14 3.544112e+18  699185100 *
##      5) Budget>=8.9e+07 252 2.029945e+19  370705400  
##       10) First Genre=Biography,Comedy,Crime,Documentary,Drama,Fantasy,Horror 41 7.609659e+17  218163700 *
##       11) First Genre=Action,Adventure,Mystery 211 1.839907e+19  400346200  
##         22) Run Time (Mins)< 139 151 5.717508e+18  337193600 *
##         23) Run Time (Mins)>=139 60 1.056373e+19  559280300  
##           46) MPAA=R 14 3.150352e+17  270567100 *
##           47) MPAA=PG,PG-13 46 8.726558e+18  647149600  
##             94) Release Month=Aug,Jan,Jun,Sep 13 1.037251e+18  376371400 *
##             95) Release Month=Apr,Dec,Feb,Jul,Mar,May,Nov,Oct 33 6.360646e+18  753819800 *
##    3) Budget>=1.445e+08 157 3.127963e+19  644473100  
##      6) Budget< 2.185e+08 133 1.527063e+19  562482600  
##       12) Run Time (Mins)< 120.5 66 3.358576e+18  448968700 *
##       13) Run Time (Mins)>=120.5 67 1.022388e+19  674302200 *
##      7) Budget>=2.185e+08 24 1.016019e+19 1098837000  
##       14) Release Month=Jul,Jun,Mar,May,Nov,Oct 17 1.854293e+18  799961800 *
##       15) Release Month=Apr,Dec 7 3.099440e+18 1824678000 *
test_predictions <- predict(model, new_data = test_baked)
test_predictions
## # A tibble: 373 × 1
##         .pred
##         <dbl>
##  1 218163671.
##  2 201788551.
##  3 168897406.
##  4 168897406.
##  5 168897406.
##  6 168897406.
##  7 168897406.
##  8 219535487.
##  9 168897406.
## 10 168897406.
## # ℹ 363 more rows
# Plot tree
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.2
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
rpart.plot(
  model$fit,
  roundint = F,
  cex = 1,
  fallen.leaves = F,
  extra = "auto",
  main = "Regression Tree"
)

# Results
res <- model %>% predict(new_data = test_baked) %>%
  bind_cols(test_baked %>% dplyr::select(Worldwide))
res %>% yardstick::metrics(truth = Worldwide, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator     .estimate
##   <chr>   <chr>              <dbl>
## 1 rmse    standard   210140046.   
## 2 rsq     standard           0.307
## 3 mae     standard   129842382.
res %>% ggplot(aes(x = .pred, y = Worldwide)) + geom_point() +
  labs(title = "Prediction vs Actual",
       subtitle = "Decision Tree - Regression")

# Assuming 'model' is your decision tree model and 'df_test' is your testing data
library(parsnip)
library(yardstick)
library(vip)
## Warning: package 'vip' was built under R version 4.3.2
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
# Assess variable importance
importance <- model %>% vip()

# Assess variable relationships
plot(importance)

# Sampling
df_split2 <- df %>% rsample::initial_time_split(prop = 0.9999)
df_train2 <- rsample::training(df_split2)

df_train2 <- df_train2 %>% 
  dplyr::mutate(Director_Count = str_count(Director, ",") + 1,
                Writer_Count = str_count(Writer, ",") + 1,
                Genres_Count = count_genres,
                Primary_Genre = `First Genre`)
# Training Data
recipe_pipeline2 <- recipes::recipe(Worldwide ~ Budget + 
                                      Distributor + 
                                      `Earliest Release Date` + 
                                      MPAA + 
                                      `Run Time (Mins)` + 
                                      Star + 
                                      Director_Count + 
                                      Writer_Count +
                                      Genres_Count +
                                      Primary_Genre, 
                                    data = df_train2) %>%
  step_mutate(Release_Month = month(`Earliest Release Date`, label = TRUE)) %>%
  step_dummy(Star, one_hot = TRUE) %>%
  recipes::prep()

train_baked2 <- recipes::bake(recipe_pipeline2, df_train2)

# MODEL
lm_fit <- parsnip::linear_reg(mode = "regression") %>%
  parsnip::set_engine("lm") %>%
  parsnip::fit(Worldwide ~ ., data = train_baked2)

top_stars <- head(star$Star, 50)

ui <- fluidPage(
  titlePanel("Movie Revenue Prediction"),
  sidebarLayout(
    sidebarPanel(
      fluidRow(
        column(width = 6,
          textInput("budget", "Budget", value = ""),
          selectInput("distributor", "Distributor",
                      choices = c("Warner Bros.", "Universal Pictures", 
                                  "Walt Disney Studios Motion Pictures", 
                                  "Sony Pictures Entertainment (SPE)", 
                                  "Twentieth Century Fox", "Paramount Pictures", 
                                  "Lions Gate Films", "DreamWorks Distribution"), 
                      selected = "Warner Bros."),
          dateInput("release_date", "Release Date", value = ""),
          selectInput("mpaa_rating", "MPAA Rating",
                      choices = c("PG", "PG-13", "R"), selected = "PG-13"),
          numericInput("director_count", "# of Directors", value = 1)
        ),
        column(width = 6,
          numericInput("writer_count", "# of Writers", value = 1),
          selectInput("star", "Star", choices = top_stars, selected = top_stars[1]),
          numericInput("duration", "Duration (Mins)", value = 120),
          numericInput("genres_count", "# of Genres", value = 1),
          selectInput("primary_genre", "Primary Genre",
                      choices = c("Adventure", "Comedy", "Fantasy", "Animation", "Family", 
                                  "Biography", "Drama", "History", "Action", "Sci-Fi", 
                                  "Crime", "Mystery", "Thriller", "Musical", "Romance", 
                                  "Horror", "Sport", "Documentary", "Music", "War", 
                                  "Western", "Short"), 
                      selected = "Action")
        )
      ),
      actionButton("predict_button", "Predict Revenue")
    ),
    mainPanel(
      textOutput("prediction_output")
    )
  )
)

server <- function(input, output) {
  predicted_revenue <- eventReactive(input$predict_button, {
    new_data <- tibble(
      Budget = as.numeric(input$budget),
      Distributor = input$distributor,
      `Earliest Release Date` = as.Date(input$release_date),
      MPAA = input$mpaa_rating,
      Director_Count = as.numeric(input$director_count),
      Writer_Count = as.numeric(input$writer_count),
      Star = input$star,
      `Run Time (Mins)` = as.numeric(input$duration),
      Genres_Count = as.numeric(input$genres_count),
      Primary_Genre = input$primary_genre,
    )

    # Bake the new data using the updated recipe
    new_data_baked <- recipes::bake(recipe_pipeline2, new_data)

    # Make predictions using the trained model
    predictions <- predict(lm_fit, new_data_baked)

    # Return the predicted revenue as a numeric value
    as.numeric(predictions)
  })

  # Display the predicted revenue as a numeric value
  output$prediction_output <- renderText({
    predicted_revenue()
  })
}

# Run the Shiny app
shinyApp(ui, server)
Shiny applications not supported in static R Markdown documents